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Abstract 

Huber’s gross-errors contamination model considers the class T e of all noise distributions 
F = (1 — £)<f> + eH , with $ standard normal, £ £ (0,1) the contamination fraction, and H 
the contaminating distribution. A half century ago, Huber evaluated the minimax asymptotic 
variance in scalar location estimation, 

min max V(ip, F) = , (1) 

where V(ip,F) denotes the asymptotic variance of the (M)-estimator for location with score 
function ip, and I(F*) is the minimal Fisher information minjr e 1(F). 

We consider the linear regression model Y = Xd 0 + W, Wi ~u.d. F, and iid Normal predictors 
Xj j, working in the high-dimensional-limit asymptotic where the number n of observations and p 
of variables both grow large, while n/p —> m £ (1, oo); hence m plays the role of ‘asymptotic num¬ 
ber of observations per parameter estimated’. Let V^ip, F) denote the per-coordinate asymptotic 
variance of the (M)-estimator of regression in the n/p —► m regime |EKBBL13t IDM13I IKarl3j . 
Then V m ^ V ; however V m —> V as m —> oo. 

In this paper we evaluate the minimax asymptotic variance of the Huber (M)-estimate. The 
statistician minimizes over the family (V , a)a>o of all tunings of Huber (M)-e stimates of regression, 
and Nature maximizes over gross-error contaminations F £ F e . Suppose that I(F*)-m > 1. Then 

i /m . (2) 

Of course, the RHS of (§is strictly bigger than the RHS of 0. Strikingly, if I(F*) ■ m < 1, then 

min max V m (ip\,F) = oo. 

A 

In short, the asymptotic variance of the Huber estimator breaks down at a critical ratio of observa¬ 
tions per parameter. Classically, for the minimax (M)-estimator of location, no such breakdown 
occurs [DH83j . However, under this paper’s n/p —» m asymptotic, the breakdown point is where 
the Fisher information per parameter equals unity: 

£* = £^(Minimax Huber-(M) Estimate) = inf{£ : to • I(F'*) > 1}. 


* Department of Statistics, Stanford University 

^ Department of Electrical Engineering and Department of Statistics, Stanford University 
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Dedication. Based on a lecture delivered at a special colloquium honoring the 50th anniversary 
of the Seminar fur Statistik (SfS) at ETH Zurich, November 25, 2014. The year 2014 was simul¬ 
taneously: the 80th birthday year of Peter Huber, the 50th anniversary of his great 1964 paper on 
Robust Estimation, and the 50th anniversary of SfS. All of these events are causes for celebration, 
and we thank especially Peter Biihlmann, Sara van de Geer, Hansruedi Kiinsch, Marloes Maathuis, 
Nicolai Meinshausen, and indeed everyone at SfS for creating a wonderful commemoration event. 
Special congratulations to Peter J. Bickel on receiving his Doctor Honoris Causa from ETH as part 
of this celebration! 


1 Introduction 

Fifty years ago, Peter Huber published the masterwork |Huhfi4] in the Annals of Mathematical 
Statistics. His paper, ‘Robust Estimation of a Location Parameter’ revealed robust statistics to be 
amenable to mathematical analysis, producing a new optimal robust estimator - now called the 
Huber (M)-estimator - that has proven practical, elegant and lasting. Richard Olshen once called 
Peter’s paper ‘an out-of-the-park, grand-slam home run’Fl 

Only 8 years after this initial paper in statistics, Peter delivered the Wald Lectures |Hub73j . 
recognition from the profession of the exceptional importance of his oeuvre. While Huber’s 1964 paper 
considered the estimation of a scalar location parameter, his Wald Lectures summarized work showing 
that much of the framework of the 1964 paper generalized immediately to regression estimation. 


1.1 (M)-estimates of Regression 

Consider the traditional linear regression model 


Y = X0 o + W, 


(3) 


with Y = (Y \,... ,Y n ) T E M n a vector of responses, X E R nxp a known design matrix, 6q E M p a 
vector of parameters, and W E M n a random noise vector with i.i.d. components having marginal 
distribution F = FwE 

To estimate do from observed data (Y, X) we use an (M)-estimator. Picking a non-negative even 
convex function p : M —> M>o, we solve the optimization problem]^] 


0(Y; X) = arg min V p(Y, - -W ■ 0) , 


(4) 


1 Terminology from American baseball. The highest-impact scoring outcome that can ever be delivered by a batsman, 
and not at all frequent. Wikipedia states that over 112 annual World Series, comprising more than 500 games, and ten 
thousand at-bats, this has happened only eighteen times. 

2 With a slight abuse of notation, we also use W to denote a scalar random variable with the same marginal 
distribution Fw- 

3 Xi, ..., X n denote the rows of X; while 8 denotes a column vector. 9 is chosen arbitrarily if there are multiple 
minimizers. 


2 








Of course the prescription is broad enough to encompass traditional least squares - PLs(t) = t 2 - 
however, this would not be robust to outliers [^J Better choices might include least absolute deviations 
- pijAn(t) = |£| - and of course the Huber p - pn(t; A) = min(t 2 /2, A|i| — A 2 /s)jf] 


1.2 Fixed p, large n Minimax Robustness 

Consider the random design case where Xj ~ ii( i N(0,I p ), and let ip = p' denote the score function 
associated to the (M) estimator of interest. Let n —> oo with p fixed, and consider the per-coordinate 
asymptotic variance 

Tl ^ 

Voo(ip, F) = a .s. lim - • Tr(Var F (0)). 

n—> oo p 

Huber proposed to consider V^, (ip,F) as the payoff function in a game between the statistician and 
nature. The two arguments of T4c represent the two choices being made here: the statistician is 
choosing the estimator, by specifying ip, and ‘nature’ is choosing the error distribution, by specifying 
F = Tjy. The statistician pays out the amount I4o (ip,F) and, planning for all eventualities, wants 
to minimize the worst-case payout. The statistician envisions that F might contain a fraction e of 
‘bad data’, and so assumes that the action space of Nature is the class J- e of all contaminated normal 
distributions F = (1 — e)<h + eH. Here notes the standard normal, e E (0,1) the contamination 
fraction, and H the contaminating distribution. 

For a given choice ip, the maximal payout that can arise is maxf £ j Voo(ip, F). Huber proposed 
that the statistician should minimize this quantity across ip, thus obtaining the minimax asymptotic 
variance and the associated minimax score. He found the least-informative distribution, Ff - the cdf 
F solving rnirip G jr e 1(F) with I the Fisher information for location, and Huber obtained the formula 


min max V^o (ip, F) 
ip FeT e 


1 

Wfr 


(5) 


He also discovered the minimax-optimal score function, now called the Huber score; it has the form 


ip\(x) = min(A, max(—A, x)), 


for a specific k = k*(e), achieving the minimax. Numerous textbooks cover this material, including 
of course |HRD9] : see also Section [2. 1 1 below. 


1.3 High-Dimensional Asymptotics 

In his Wald lectures [Hub731 Page 802] Peter Huber called attention to the fertile regime beyond the 
fixed p, large n asymptotic, 

We intend to build an asymptotic theory for n —> oo; but there are several possibilities 
for the concomitant behavior of p. In particular, with decreasing restrictiveness: 

(a) lim sup p < oo 


4 As can be documented by Frank Hampel’s notions of Influence Curve lHam74l , which shows that least squares has 
unbounded influence, and Breakdown Point, which documents that a single bad observation can cause the least squares 
solution to misbehave arbitrarily. 

5 Other seemingly good choices, like p(t) = — log(l + 1 2 ) are ruled out by lack of convexity. 
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(b) limp 3 /n = 0 

(c) limp 2 /?z = 0 

(d) lim p/n = 0 

(e) lim sup p/n < 1 

(f) lim sup n — p = oo. 


Huber also initiated the attack on this hierarchy of new asymptotic settings, addressing cases (b)-(d). 

Though this was 40 years ago, it has taken the profession a while to catch upj^J In recent years, 
the focus of mathematical statistics research has finally gone beyond the fixed p, large n asymptotic, 
to consider regimes (d)-(e), where n and p are both larg^j 

In this paper, we consider a precise version of case (e), which we call the Proportional-Limit 
asymptotic PL{m ); in this regime n,p —> oo and n/p —> m E (1, oo). Thus m measures the number of 
observations per parameter to be estimated. This parameter seems to recur frequently in practitioner 
thinking: Huber specifically mentions in his 1972 Wald lectures the advice from crystallographer^] 
to keep [^] n/p > 5. 

In this paper the assumption PL(m) will further entail a random Gaussian design, normalized so 
for each n, X t ~ iid N{ 0, yjpxp)] and the regression parameter 6q = 0o, n G will be normalized so 
that the per-coordinate size p -1 ||#o,n ||2 — > a .s. T o- hi this model E{X T X} = I pxp , and so under stan¬ 
dard Gaussian errors F = <h, the per-coordinate Fisher Information is 1 for every n. Because of the 
finiteness of the total Fisher Information per coordinate, we are not entitled to expect highly precise 
estimation; hence it should be no surprise to find that the MSEp _1 ||0 n — #o||| ~^a.s. AMSE(#,#o) 7= 0. 
(Here and below AMSE stands for asymprotic mean square error.) Consider as performance measure 
the per-coordinate asymptotic variance: 

V m (il),F) = a . s . lim - • Tr(Var F ((9 n )). 

n—7oo p 

The notation F) emphasizes both the dependence of the asymptotic variance on ?/ and F as in 

the classical case, but also the dependence onm£ (1, oo). Recent work on (M)-estimates in PL(m) 
by |EKBBL131 IDM13] shows that V m {il},F) > V/o {//>■, F), while V m (ip,F) —> Voo^^F) as m oo. 

Here we will carry out the Huber program of evaluating the minimax asymptotic variance of 
the Huber estimate - this time for F ) for m E (1, oo), rather than the classical case T^o. The 

6 Peter Bloomfield entered this area already in 1974 Blo74| . and Stephen Portnoy in 1984 IPor841 . Soviet-era 
mathematicians also began studying the high-dimensional asymptotic in the late 1960’s just when Huber was also 
thinking about it; and so Serdobolskii ISerlOl speaks of the Kolmogorov asymptotic, crediting Andrei Kolmogorov with 
calculations in the proportional-limit asymptotic already in 1967. Nevertheless, Huber’s 1972 Wald Lectures were 
certainly the earliest high-profile venue marking out this asymptotic for future research 

7 A few references here may suffice: | CT071 IBRT091 IBvdGlll IEKBBL13] . 

8 Huber’s wife Effi Huber-Buser was trained as a crystallographer and in the experience of DLD is an insightful 
scientist, even knowing quite a lot even about the field of statistics and the statistical profession. 

9 In DLD’s first linear models statistics course, based on the classic Daniel and Wood T>W991 . the instructor 
specifically mentioned n/p > 10 as a desirable ratio. It will be clear from the main results of this paper that the 
prescription to keep n/p > 5 was very good advice indeed. 
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statistician minimizes V m over the family (V ; a)a>o of all tunings of Huber (M)-estimates of regression, 
and Nature maximizes over gross-error contaminations F E T e . 

The classical solution m = oo plays an important role even in the PL(m ) case. Suppose that 
Huber’s least-informative distribution F* obeys I(F*) ■ m > 1. In dimensional analysis I(F*) is the 
Fisher information per observation, while m is the number of observations per parameter. Hence 
this product is the Fisher information per parameter. Suppose that this exceeds 1. Then our main 
result (Corollary |5.6|) shows that 


min max V m ('ib\, F) 
A F&T e 


1 

I {F *) — 1 /m 


( 6 ) 


Of course, the RHS of ([6]) is strictly bigger than the RHS of the classical (m = oo) case <§• As 
compared to the classical m = oo case, when 1 < m < oo the worst-case asymptotic variance is no 
longer given by the reciprocal of the worst-case Fisher Information. However, the discrepancy grows 
small as m —> oo. Hence, new phenomena emerge in the high-dimensional situation. 


1.4 Variance Breakdown 

Suppose now that the minimal Fisher information per parameter does not exceed 1 - i.e. that 
I(F*) ■ m < 1. Then our main result additionally states that 

min max V m (il)\,F) = oo. 

A FeF £ 

In short, the asymptotic variance of the Huber estimator breaks down at a critical ratio m = m*{e) 
of observations per parameter. Hampel (1968) defined the breakdown point - the minimal fraction 
of gross errors that can drive the estimator beyond all bounds. Later, in connection with non-convex 
(M) estimators - such as Hampel’s redescending (M)-estimator - the phenomenon of breakdown of 
asymptotic variances arose; see |DH83l Section 5.2], For Huber’s minimax (M)-estimator of classical 
location, no such breakdown occurs: for each e E (0,1), 

min max Uoo(V’A) F) < oo. 

A FeTe 

Huber, in personal communication, at one time considered this non-breakdown of the asymptotic 
variance to be a notable advantage of the Huber estimator in comparison to some other procedures, 
such as the Hampel ‘redescending’ score function. 

Under this paper’s PL(m) asymptotic, variance breakdown of Huber (M)-estimates indeed occurs, 
For a fixed ratio rn of observations per parameter, the variance breakdown point is exactly the critical 
fraction of contamination e where the minimal Fisher Information per parameter drops to 1 or smaller: 

e* = ^(Minimax Huber-(M) Estimate) = inf{e : m ■ I{F*) = 1}. 

1.5 Illustration 

As a first deliverable of this paper, consider Figure 1, which displays the minimax asymptotic variance 
as a function of the contamination fraction e and the degrees of freedom per parameter estimated m. 
Below the critical curve - 1/m = I(F*) - we present contours of the minimax asymptotic variance; 
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Contours of V^(e) and / (F*) ( —•) 



Figure 1: Minimax asymptotic variance V^(e). Each pair (e,m) is represented by the (x, y)-point 
with x = e and y = 1/m. The resulting parameter space 0 < e, 1/m < 1 is divided into two phases 
- below and above the critical curve indicated by the dashdot line. Contours of the asymptotic 
variance are depicted in the lower phase; they are undefined in the upper phase, where the 

asymptotic variance cannot be bounded: V = +oo. The boundary separating the two phases is 
indicated by the dashdot curve, at 1/m = I{F*). 


in the lower left corner, the asymptotic variance is nearly 1, as it would be in the classical m = oo 
e = 0 case, The minimax asymptotic variance blows up as we approach the dashdot curve. 

A second deliverable is provided by Figure [2j which presents contours of the minimax tuning 
parameter A*(e, m); this selects the Huber p\ that achieves the minimax asymptotic (V m -) variance. 
Figure [2] shows that how A* decays towards zero as (e,m) approaches the critical curve. 

Table [l] gives some specific numerical values of the minimax asymptotic variance V^(e). When 
m = 2, it turns out that the minimax asymptotic variance breaks down at exactly e* = 0.1924..., 
this is the value of e where I(F’**) = 1/2; the dramatic increase in variance as e e* is plain from 
the table. 

We conducted a small Monte-Carlo experiment to illustrate these concepts. With n = 500 and 
p = 250, so m = 2, we considered the linear model with iid Normal predictors Xij, and contaminated 
normal errors IF,:, where F\y = G etll = (1 — e)3> + el4 /( , and denotes the symmetric Heaviside 
CDF, with mass spread equiprobably at ±/j. 

The reader can see in Table [ 2 ] that, for small e = 1/20, even as we make the contamination 
increasingly large, by setting p = 100, the empirical standard error stays bounded, independently 
of contamination amplitude p. However, as e approaches the breakdown point eZ, = 0.1924..., the 
variance grows considerably as p grows large. 
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£ 

0.05 

0.10 

0.15 

0.175 

0.1875 0.20 0.25 

V|(e) 

3.38 

5.84 

13.9 

35.0 

136.4 oo oo 


Table 1: Worst-case asymptotic variance of minimax-tuned Huber (M)-estimator, at various levels 
of contamination; degrees of freedom per parameter m = 2. 


£ 

h 

Vari^) 1 / 2 

0.05 

2 

1.5883 

0.05 

5 

1.8662 

0.05 

10 

1.8801 

0.05 

20 

1.8594 

0.05 

100 

1.8436 

0.1875 

2 

1.9900 

0.1875 

5 

3.5099 

0.1875 

10 

5.5643 

0.1875 

20 

8.7302 

0.1875 

100 

37.8817 


Table 2: Empirical Standard Error of minimax-tuned Huber (M)-estimator, at various amplitudes /i 
of contamination; degrees of freedom per parameter m = 2. Here the amplitude of the contamination 
is /x. These empirical data reflect this paper’s theoretical; conclusion that for e small, variability stays 
controlled as fi —> oo, but as e approaches the breakdown point (here 0.1924...), variability grows 
very large as n increases, even though it will still ultimately stay bounded below the breakdown 
point. 
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Contours of A*(e,m) 



Figure 2: Minimax A*(e;m). Each pair (e,m) is represented by the point x = e and y = 1/m. 
Contours of the minimax A parameter A*(e;m) are depicted in the region below the dashdot curve 
at 1/m = I(F*). 


2 Reminders 


2.1 Classical (M) Estimation and minimax asymptotic variance 

Huber (1964) supposed we have real scalar observations Y% = do + W t where Wp are iid and symmet¬ 
rically distributed, so that P(W > x) = P {W < —x). Hence 6q € M is the center of symmetry of the 
distribution of Yi, and so also the mean, median, etc. He introduced the (M)-estimator as a solution 
0 of 

n 

(M) - 9 ), 

i =1 

where p is an even convex function, p{x) = p(—x), so the score function ip = p' was monotone 
nondecreasing. Under additional regularity conditions, he showed that any solution 6 n obeys 

\fn(Q n — 6q) =>_d N(0, V(ip, F)), n ->• oo, 


where the asymptotic variance is given by 


V(iP,F) 


J ip 2 dF 
(f ip'dF) 2 ‘ 


( 7 ) 


For further discussion of regularity conditions, see |HR09j . 






Huber considered the situation where the random variable Wi was distributed roughly as N (0.1), 
but is subject to gross-errors contamination. He evaluated 


v* (e) = min max V(ip, F), 
b Fe7 E 

and found the following insightful form. Let 1(F) = f (f'(x)) 2 /f(x)dx denote the Fisher information 
for location; the least informative distribution F* minimizes this quantity: 


i*(s) = min 1(F ); 

F KIJ£ 


Huber characterized the minimax asymptotic variance as the reciprocal of the minimal information: 

«•(£> = 7 1 


i* (e) ’ 


and using this was able to write closed formulas for the optimal shape of ip - now called the Huber 
score function. In the original paper this was denoted 


ip K (x) = min(«;, max(x, — k)) 


with so-called capping parameter k, such that errors larger in absolute value than k get capped. 
Huber obtained closed form expression^] for the minimax capping parameter n = k* (e ), the least 
favorable F = F*, and the minimax asymptotic variance v*(s) = V(ip K *( £ ), F*). Figure |3] displays 
the behavior of v*(e) and k*(e), as well as i*(e). 


2.2 Regularized Score Functions 

Huber’s (M) estimator of regression uses, for some fixed A > 0, 


PaO) 


z 2 /2 if \z\ < A, 

A|z| — A 2 /2 otherwise. 


( 8 ) 


Huber’s p is quadratic in the middle, has linear tails, and is continuous with a continuous derivative. 
This is straight out of Huber’s theory for the location problem, so no-one should be confused by 
the switch from k to A to denote the threshold for transition from quadratic to linear; it simply is 
convenient below to use A rather than k in the regression case. 

For the AMP algorithm discussed below, we need the family of regularized p-functions, where for 
each regularization parameter r > 0, 


p(z\ r) = min { rp\(x) + \ (x - z) 2 ) 
zeK 1 2 ) 


(9) 


Associated to this is a regularized score function T(z) = ^f(z\ r). 
original score ipy. 



|DM13j writes it in terms of Huber’s 


( 10 ) 


10 For example, i*(e) = j(K*(e), e), where j(i t, e) = (1 — e) fp K x 2 cf>(x)dx + k 2 ■ (e + (1 — e) ■ 2 • $(— k)) and K*(e) = 
argmin J(n,e). 
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Minimax Variance Minimum Fisher Information Minimax Capping 



Figure 3: Minimax quantities in the [Hub64] scalar minimax problem, as a function of contamination 
fraction e. From left: v*(e) (semilog plot); i*(e) and K*(e). 


In particular the shape of each 4' is similar to i/j, but the slope of the central part is now ||'F / ( •; r)||oo = 

— < 1 
l+r ^ 

As explained in }DM13j . although one uses the Huber as the basis of a high-dimensional 
regression estimation, the effective score function of that (M)-estimator belongs to the family \H(-; r), 
for a particular choice of r, defined below. 

2.3 AMP algorithm 

The approximate message passing (AMP) algorithm we proposed in jDM13j for the optimization 
problem Q is iterative, starting at iteration 0 with an initial estimate 9 0 6 M p . At iteration 
t = 0,1,2,... it applies a simple procedure to update its estimate 9 f e M p , producing 9 t+1 . The 
procedure involves three steps at each iteration. 

Adjusted residuals. Using the current estimate 9 t , we compute the vector of adjusted residuals 
B* 6 M n , 

R t = Y-X0 t + *(R t - 1 -,r t - 1); (11) 

where to the ordinary residuals Y — X0* we here add the extra terirj 11 1 'F(/? t “ 1 ; rt~ i). 


11 Here and below, given / : R —> R and v = (vi,..., v m ) T £ K m , we define f(v) £ R m by applying / coordinate-wise 
to v, i.e. /0) = (f(v i),..., f(vm)) T - 
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Effective Score. We choose a scalar r* > 0, so that the effective score \H( •; rf) has empirical average 
slope p/n E (0,1). Setting m = m(n) = n/p > 1, we take any solutiorp 2 ] (for instance the 
smallest solution) to 


1 

m 


n 




i =1 


Scoring. We apply the effective score function ^(i?*; r^): 

e t+1 = 6 t + mX T 'b(R t ;r t ). 


( 12 ) 


(13) 


We emphasize that the above procedure, although presented as an algorithm, will in fact be used 
simply a tool in proving results about (M)-estimates. 

2.4 State evolution description of AMP 

State Evolution (SE) is a formal procedure for computing the operating characteristics of the AMP 
iterates 6 l and U! for arbitrary fixed t, under the PL(m) asymptotic n,p —> oo, n/p —> m. The 
ideas have been described at length in |DM13j . Namely, for the t-th iteration of AMP, consider the 
quantity 

t? = lim — we* - 9 o\\ 2 2 = — AMSE(0 4 ; 0 O ). 
n->o o pm m 

SE offers a way to calculate r* using rj_ i, and by extension calculating the limiting AMSE m lim^oo r f. 

At the heart of State Evolution are the effective noise level at = \/l + t/ , which changes iter¬ 
ation by iteration as the statistical properties of the AMP iterates evolve; it reflects the combined 
impact on the estimation of a parameter of observational noise W with standard deviation 1 (on 
the uncontaminated data) together with estimation noise r that ‘leaks’ from the other estimated 
parameters. 

Also there is the notion of the effective slope: the well-defined value r = 1Z(t] m, A, F\y) giving 
the smallest solution r > 0 to 

bE{V + rZir)j, 

where W ~ Fw, and, independently, Z ~ iV(0,1). Informally, 1Z measures the value of the regular¬ 
ization parameter r that satisfies the population analog of the AMP empirical average slope condition 
( 12 ]). 

Similarly, define the variance map 

A(T 2 ,r-,X,F w ) = E{^l(W + TZ-r)}, 

A measures the variance of the resulting effective score. Evidently, for r > 0, 0 < A(r 2 ,r) < 
(Var(VE) + r 2 ). 

In the last two displays, the reader can see that extra Gaussian noise of variance t 2 is being 
added to the underlying noise W. 


12 This equation always admits at least one solution; cf 1DM131 Proposition A.l] 
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Definition 2.1. State Evolution is an iterative process for computing the sequence of scalars {r 2 }t>o, 
starting from an initial condition Tq E M>o following the recursion 

T t+i = m ■ -4(r 2 , K{T t )) = m ■ A(T?,n(T t ; m, A, F w )\ re, F w ). (14) 

Defining T(r 2 ) = m ■ we see that the evolution of r 2 follows the iterations of the 

map T. In particular, we make these observations: 


T{ 0) > 0, 

T(t 2 ) is a continuous, nondecreasing function of r. 

T(t 2 ) < c • t 2 for some c E (0,1) and all sufficiently large r. 


As a consequence of Theorem 2.2 below, T has a unique fixed point r^, i.e. 

T(t 2 ) = t 2 . 

7 V oo / oo 


If follows from the above properties that this fixed point is stable and attracts (r 2 ) from any starting 
value. Explicitly, for each initial value tq E (0, oo), the sequence defined for t = 1,2,... by r 2 = 
T(r 2 _ 1 ) converges to the above fixed point: 

T t^ T loi ast-^oo. 


2.5 Correctness of State Evolution 

The paper |DM13| considers (M) estimates with strongly convex ^-functions - this excludes the 
Huber estimator for technical reasons. In that paper, |DM13i Theorem 3.1] shows that State Evolu¬ 
tion correctly computes the operating characteristics of the AMP algorithm. In particular, the AMP 
algorithm has m ■ r^ for its t —> oo limiting AMSE in estimating do- 

Within the strongly convex setting, ! DM131. Theorem 4.1] shows that the AMP algorithm con¬ 
verges in mean square to the (M)-estimator, which is therefore also described by the fixed point of 
State Evolution. 

Define the asymptotic variance of the (M)-estimator 9 by 

AVar(0) = lim Ave ig [ p iVar(0j), 

n,p—>-oo 

where Avejgjp] denotes the average across indices i. DM 13 Corollary 4.2] shows that the asymptotic 
variance of 9 obeys 

AVar(0j) = mr (15) 

It follows that State Evolution describes not only the operating characteristics of the large f-limit 
of the AMP algorithm, but any algorithm for obtaining the (M)-estimate in the PL(m ) asymptotic. 
So the fixed point of the one-dimensional dynamical system r 2 t—> T(t 2 ) is fundamental. 

All these results extend to the Huber estimator itself. The companion paper )DM15] proves the 
following extension of the results in [DM13] , 
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Theorem 2.2. \DM1 5j / Suppose that (Too, Too) solve the two equations 

- = E \*' x (W + TZ-,r)\, 

m, l ) 

t 2 = mEj'h A (IT + rZ;r)| . 

Then under the PL(m)-limit, the Huber (M )-estimator 6 = (6i)i obeys: 

AVar(0) = m ■ r^. 

In particular, this implies that such fixed point is unique. 

We note that, at the fixed point (r^Too), we have 

E + t Z-,r)\ 

AVar(0) = 1 -L • 

E|^' A (W + rZ;r)} 

the expression on the RHS can be written in terms of Huber’s asymptotic variance formula V Q: 
it is V{fi>Fyy * N( 0,r^)). In other words, the classical Huber asymptotic variance formula 
continues to hold in an extended sense; however, it is evaluated at the effective score function r) 
with respect to the effective error distribution F\,y * IV(0, see (EKBBL13j for another approach 
to this formula. 

3 Least-Favorable State Evolution 

In this section we develop an upper bound on the behavior of State Evolution. We first introduce a 
variant of SE, in which A evolves rather than staying fixed. This variant can be conveniently analyzed. 
In a later section, we tie the results obtained for this evolution to the original state evolution. 

3.1 Floating-Threshold State Evolution 

Recall the notion of effective noise level at = \J 1 + rf in state evolution, and consider a variant of SE 
where the threshold parameter A t ‘floats’ proportionally to the noise level at, as follows A t = k • at- 
Here k may be viewed as the capping parameter for data which are presumed to be standardized, 
and so the floating Af is actually invariant across iteration - when expressed in multiples k of the 
effective noise level. 

In an abuse of notation, define A with a k (rather than A) as argument to be the variance map, 
based on floating A: 

A(t 2 , r; «, F w ) = E {*l a (W + r Z; r)} . 

Compounding the abuse, define r = TZ(t ; m, k. F\y) analogously, so that 

-=E {KAW + rZ-r)}. 

m l ) 

Similarly, we define T(r 2 ; m, k, F\y) = m ■ A(t 2 ,H(t-, m, k, F\y)), without any warning to the reader 
that the same symbols are being used as in the earlier state evolution with fixed A while here and 
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below the appearance of k in the argument always refers to the floating A evolution. For example, 
we might write r^(m, k, F ) for the fixed point of a floating-A evolution and r£,(m, A, F) for the (in 
general different) fixed point of a fixed-A evolution. As a first justification for this, note that the 
fixed points of the two different dynamical systems (fixed- and floating- A dynamical systems) are in 
one-one correspondence, via 

A = K • V 1 + r cL 

i.e. 


• The fixed-A fixed point t% 0 ( A) is identical to the floating-A fixed point t^(koo(A)), under the 
floating-A parameter Koo(A) = A/y^l + r^(A); while 

• The floating-A fixed-point r^. (ft) is identical to the fixed-A fixed point A 0 O (k)) at parameter 

Aoo = k ■ + . 


Setting A = A^ (m, k, F\y) and k = ^(m, A, FV) establishes the correspondence. Hence character¬ 
izing the fixed points of the floating A scheme will also characterize those of the fixed lambda scheme; 
see also Definition 5.1 et seq. below. 


3.2 Least-Favorable SE 

Let Hoc denote the improper distribution with its probability mass placed evenly on {± 00 }; with 
this notation, set F e = (1 — e)d> + eiLoo- We now describe an extremal form of floating-threshold 
state evolution. 


Definition 3.1. Least Favorable State Evolution (LFSE) is an iterative process for computing a 
sequence of scalars {f 2 }t>o, starting from an initial condition Tq e M>o- An instance of LFSE is 
determined by Tq together with fixed positive scalars m, k and e. 

At the t-th iteration, one needs the (t — 1) ’th result ft -1 and sets at -1 = + f 't_\ , 

f t = U(f t -i-,m,K,F e ) 
t? = m ' f t ) K,F e ) 

The procedure is then repeated at the next iteration t + 1, and so on 

Letting <L CT denote the CDF for N(0,a 2 ), set F eJT = (1 — e) < h 0 - + eHoo-, and define an improper 
F £ .a, taking infinite values with positive probability. Setting a = (1 + r 2 ) 1 / 2 , 
written in terms of the improper random variable X e & t _ 1} 


random variable X £i(T ~ 
we have F e a = F e -k$> T 


Definition 


3.1 


and the floating threshold A t = k ■ at- 1 , gives: 

1 


— (X e at-1 1 r t) 1 
m 1 


and 


H =m-E'H- x (Xejtt-vrt). 


Although X E jj t _, is an improper random variable, these expectations are well definec^J We refer 
to instances where state evolution is applied to proper distributions in F e as proper state evolutions. 


13 given the boundedness and differentiability of the underlying Huber ip 
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Lemma 3.2. (LFSE Dominates.) Consider a given instance (m, to, k,F) of floating-threshold 
state evolution where F E F e . The LFSE instance (m, To, k, e ) dominates this proper state evolution, 
namely: with ft the sequence of LFSE regularizing parameters and r t the sequence of proper SE 
regularizing parameters, 

r t > rt , t = 1,2,..., 

while for t 2 the MSE under LFSE and t( under proper SE, respectively, we have: 

ft>Tt, t = 0,1,2,... 

Figure [4] illustrates the dominance of LFSE; it shows that the corresponding dynamical maps 
obey T >T. 

The proof - given in the appendix - will depend on the following sequence of observations: 

Lemma 3.3. Monotonicity in x, r, and A. Let T(x,r) denote the regularized score function based 
on Huber’s (With A fixed unless stated otherwise.) 

1. For each fixed r e M + , ^(x, r) is a monotone increasing function of \x\; 

2. For each fixed r 6 M+, ^f' x (x,r) is a monotone nonincreasing function of \x\; 

3. For each fixed x G M; r H > |\k,\(x, r)| is monotone nondecreasing in r; and 

4. For each fixed x G M, A 1 —> |TA(x,r)| is monotone nondecreasing in A. 

It will also need the following invariances, which are very special to the extremal improper RV’s 
X £) o and X E , a together with the fact that the proper SE and LFSE use exactly the same k in forming 
their respective floating A’s. 

Lemma 3.4. For r > 0, and t > 1, let 0 < a t -1 < dt -1 and A t = K&t.-i, and A t = k - d t -\. 


E*' Xt (X £ , at _i,r)=E^- Xt (Xe,* t _i,ry, 

EVKX^^r) = fc=^) 2 -E *\ t (X £ -„ t _„r). 


(16) 

(17) 


3.3 The envelope functionals A and B 

To make LFSE more transparent, we introduce some helpful notation. In this subsection, we are 
again in Huber’s original location setting. The evaluation of v*(e) is made significantly easier by 
helpful notation. Suppose that F is a sub distribution, i.e. a CDF on the extended reals, and put 


A(iI>k,F) = 

roo 

/ ipl(w)dF(w) 

J — OO 

= E F^liW) 

(18) 

B{$n,F) = 

roo 

/ F(w) 

J —00 

= E f^ k (W) 

(19) 


where W ~ F. Calculating explicitly for the Huber score function, we can equally well write 

A(ip K ,F) = f w 2 dF(w) + K 2 -F f {\W\>k}. 

J —K 
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and 


BU> k ,F) =F f {\W\ <k}. 

Now define the envelope functions A and B, so that 

A(n,e) = sup {A(ij) K , F) : F € Fg} 

B(k,e) = inf 

More explicitly, with <F denoting the standard normal CDF, and Z iV(0,l), 

A(k,£) = (1 - e)A(i/j k ,®) +£K 2 

B(k,£) = ^f{\Z\ < k} = 2$(-|/e|) - 1. 

Defining VF) = A(ip K , F)/B 2 {^ K , F), and correspondingly, 

V(K,£) = sup{V^ K ,F) : F e B £ }. 


It follows from Huber (1964) that 


V(k,e) = 


A{k, e) 


B 2 (k, e) ’ 

(the inequality LHS < RHS is obvious) and also that 

v*(e) = inf V(k,e)-, 


(the inequality LHS < RHS again being immediate). 


( 20 ) 

( 21 ) 


( 22 ) 

(23) 


3.4 Explicit Solution of Least Favorable State Evolution 

We now put Huber’s notation from the previous subsection to work, giving explicit formulas for 
LFSE. 


Lemma 3.5. For a given tuple ( m,£,K) obeying (1 — e) > 1/m, there is a unique positive solution 
r(m, e, k) to 

— B(k- (l + r),e) = —. (24) 

1 + r J m 

Using this notation, we give an explicit characterization of LFSE. Let R = R(m, e,k) = k ■ (1 + r) 
as in the first argument of B in (24). 

Lemma 3.6. LFSE with parameters (m, e, k) satisfies, with R = R(m , e, tz) 

T(r 2 ;?n,e,K) = (1 + r 2 ) • V(R,s)/m, 
and, ifV(R,£) < m, LFSE has the unique stable fixed point 

V(K,e)/m 


Too {m, £, k) = 


1 — V(k ,e)/m 
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To prove this, consider a seemingly different evolution, which we call double-bar evolution: with 
f as introduced above, define 


and 


A(T 2 -,m,e, k) = (1 + t 2 ) • V(n,s)/m 2 
T(t 2 ) = m ■ A(t 2 ). 


(25) 


With m,e, k and thus r and k hxed, define a sequence r 2 for t = 0,1, 2,.... At iteration t = 0, 
we pick a starting value tq > 0, we then proceed inductively, setting all later iterates by : 

R 2 = fFt-i), t = 1,2,.... 


Now (25) sets up the dynamical system r 2 i —> T{r 2 ) as an affine dynamical system (in the variable 
f 2 ). Its fixed point (if it exists at all) must obey 

flc = (i + ^L) ' V(R,s)/m. 

So double-bar evolution has the following explicit solution: 

Lemma 3.7. Consider the double-bar evolution introduced in this section, with parameters (m, e, tz). 
If V (R, e) < m, it has the unique stable fixed point 


r L( m , e , K ) = 


V(k, e)/m 


1 — V(k, e)/m 

Otherwise there is no fixed point, and successive iterates run off to infinity. 

In fact, double-bar evolution is really just LFSE, in disguise. Results of the next subsection will 
prove: 


Lemma 3.8. With (rf) and {rf) defined by the procedure of Section 3.2 , and (r t ) (r*) defined by the 
procedure of this section, each initialized identically - tq = tq - we have 


n = r, 
ft = f t , 


t = 0,1,2,..., 
t = 0,1,2,..., 


and 


T(-) =T(-). 


Lemma 3.6 then follows from the last two lemmas. In turn, Lemma 3.8 follows immediately from 
the following: 


Lemma 3.9. 


sup 7 Z(t' m, k, F ) = r(m, e, k), 

sup A(t 2 ,F(t] m, k, F ); m , n, F) = A(r 2 -, m, e, n). 
FGT e 


(26) 

(27) 


This shows that the affine evolution (25) indeed implements LFSE, and proves Lemma 3.8 


The proof of Lemma 3.9 is given in the Appendix; it depends on terminology and results of the 
next subsection. 
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LFSE (green) and several proper SE’s (red) 



Figure 4: MSE maps of proper state evolutions and of LFSE. Here e = 0.05, m = 5, and /a = 
2, 5, 7.5,10. The variance map of LFSE is the green straight line, which lies above the variance maps 
of all the proper SE’s as depicted by red curves. Correspondingly, its fixed point is also higher. 


3.5 Bounds for A 


3.4 


The quantity A occurring in LFSE is defined using moments of ET 2 ; however, Section 
in terms of A, which uses moments of ip 2 . To explain the connection - and prove Lemma 
need to relate the two kinds of moments. 


defines A 


3.8 


- we 


Indeed, (10) says that 


r ) = 


1 + 7 ’ 


l^Afl+r) 


and we also have, for any random variable X , 


EiPKcX) = c 2 EiPl /c (X), 

while 

EV4(CX)=E^ /C (X). 

Furthermore, supposing that W has distribution (1 — £)<F + eH and that Z ~ <F while U ~ H, then 

E|T(1T + tZ ; r)| 2 = (1 - e)E|T( V / l + r 2 • Z; r)| 2 + e • E|T(17 + rZ; r)| 2 . 

Now introducing a = y/l + r 2 /(1 + r) where r is some fixed positive scalar kept the same in all the 
coming displays, 

E|tf(Vl + r 2 • Z-r )| 2 = {arf • E^ 2 /a (Z) = {arf A(X/a, $). 
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and 


E\y(U + T-Z;r)\ 2 = • Efi 2 x . (1+r) (U + rZ) = A(X • (1 + r), H * $> T ). 

Similarly, 

E V(W + tZ ; r) = (1 - e)Etf'(\/l + r 2 ■ Z; r) + e ■ Etf'(l7 + rZ; r); 

and 

E*Vl + r 2 -Z;r)= • E^ (1+r) (Vl + r 2 • Z). 

But E^,(Z) = B(X/a, 4>); so 

E^'(Vl + r 2 - Z; r) = ■ B(X/a, $). 

We have the upper bound A(A • (1 + r), H * <h r ) < (A • (1 + r)) 2 because HV’kIIoo = «, and the lower 
bound U(A • (1 + r), H *4> r ) > 0 because ij/ K > 0. Moreover, both bounds are tight, as can be seen by 
choosing the point mass with H = 5^ as p. —> oo. Combining all the above, we obtain the following. 

Lemma 3.10. With r > 0, a = \/l + t 2 /(1 + r), and F = (1 — e)<h + eH, 

A(t 2 , r; m, k, F) = (1 - e) ( ar ) 2 A(k • (1 + r), <f>). + £ • ^ ^ A(X • (1 + r), H * 4> r ), 

A(t 2 ,r;m, k, F) < (ar) 2 ■ A(n ■ (1 + r), e), 

A(t 2 k, F) —> (ar) 2 ■ A(k ■ (1 + r), e), H = <5^, p —> oo. 

Lemma 3.11. W*f/i ,8 = E’I' / (14 / + rZ; r) and W ~ F = (1 — e)4> + ell, 

B(r 2 ]m,K,F) = ■ ((! - e)-B(rc ■ (i + r), 4>) + e • E^(i +r .)(?7 + rZ)) , 

B(r 2 ,r\m, k, F) > ■ B(k ■ (1 + r),s), 

B(t 2 ,r] m, k, F) ^ • (1 + r), e), H = 5^, p -» cx). 

The proof of Lemma [3.9[ in the Appendix, combines the last two lemmas to obtain the equivalence 
of LFSE and double-bar evolution. 

4 Minimax Asymptotic Variance of Floating Threshold SE 

4.1 Minimax Formal Variance 
Definition 4.1. Define the formal variance 

V m (K,F) = m • r^(m, k,F). 

where r^(m, k,F) denotes the fixed point of the associated floating-threshold State Evolution. 

Define the minimax formal variance to be 

Vfi(e) = inf sup V m (n,F). 

K F&T e 
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The minimax problem identifies a distinguished choice of the capping parameter, offering the 
best guarantee applicable across all F € F e . Here is the solution: 

Lemma 4.2. The mapping n H > r{n',m,e) is continuous and strictly monotone decreasing. For each 
R > 0, the equation 

R = k ■ (1 + r(n)) 

has an unique solution k = k(R). 

Theorem 4.3. Let k* (e) denote Huber’s minimax capping parameter in the scalar estimation problem 
\Hub64V - Let «(■) denote the re-calibrated function defined by Lemma f.2. Define the re-calibrated 
parameter 

k*(s) = k{h*(e))- 

Suppose that m ■ I(F*) > 1; then every instance of floating-threshold state evolution having 
parameters (m,To, K*(e), F) with proper F E F e has a fixed point at = T^ 0 (m,K*(s),F) obeying 


Toe < T O 0 (m,£,K*(e)) = 


More specifically, we have the saddlepoint relation: 


v*(e)/m 
1 — v*{e)/m 


inf sup V m (n,F) = V m (/v*(e),F e ) = sup infV m (/c,F), 

K FeT e FeF s K 

with saddle point at (K*(e),F e ), and where the minimax value Vj^(e) = V m (K*(e), F e ) obeys: 

V* (e) =- V ^ -=--- 

j _ v *( £ y m - /(_p*) _ i/ m 

Figure [I] presents a diagram showing contours of V^(e). The diagram employs the unit square 
{(e, 1/m) : 0 < e, 1/m < 1} where the x-axis shows the contamination fraction e, and the y axis 
shows 1/m for plotting purposes. Only the part of the diagram where 1/m < I(F*) is populated 
with contours. The reader can see how the asymptotic variance ‘blows up’ as 1/m approaches I(F*), 
Figure [5] shows contours of the minimax capping parameter K*(e;m). The reader can see how 
the capping parameter shrinks to zero as 1 jm approaches I(F*), 


4.2 State Evolution in the Unbounded Phase 

Figure [l] has a ‘bounded’ phase, where the formal variance is bounded across all contaminating 
distributions, and a complementary so-far undescribed phase. It seems that the formal variance 
must be unbounded in this phase, since the phase consist of cases with smaller m than the bounded 
ones, and so therefore of ‘harder’ cases. Validating this intuition, we have: 

Corollary 4.4. Suppose that m • I(F*) < 1; then for each r < 00 , and each k > 0, some instance 
of proper state evolution with parameters (m, to, k, F) and proper F E T e has a unique fixed point at 
T lo = F ) obeying 
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Minimax K*(e,ra) 



Figure 5: Minimax K*(e;m). Each pair (e, m) is represented by the point x = e and y = 1/m. 
Contours of the minimax capping parameter K*(e;m) are depicted in the region below the dashdot 
curve at 1/m = I(F'*). 


Goings-on in the unbounded phase are documented in Figure |6j In the unbounded phase, every 
LFSE map T has no fixed point, whatever be the parameter k. Proper state evolutions still have 
unique stable fixed points, but there is no upper bound on their size. Hence the worst-case fixed 
point is infinite. 

This is an instance of what Donoho and Huber IDH8M] called breakdown of asymptotic variance. 
Breakdown occurs, in the (e, 1/m) phase diagram, where-ever nil(Ff ) < 1, and the breakdown point 
is ml (Ff ) = 1, the dashdot curve in our figures. 

Note that as m —> oo, we converge to the classical case, where the asymptotic variance of (M)- 
estimates does not break down. In the high-dimensional case n/p —> m G (1, oo), the asymptotic 
variance does break down. 

5 Minimax Variance of the Huber (M)-estimates 

We now develop our main result about (M)-estimates. The analysis in the last section concerns 
floating-A state evolution; while Theorem |2.2| shows that fixed-A state evolution describes the asymp¬ 
totic variance of the Huber (M)-estimate. We show how to bridge this difference. 

5.1 Minimax Formal Variance 

Definition 5.1. Calibration Relation. Suppose the proper floating threshold state evolution with 
parameters (m, tq,k, F) has a unique fixed point r^. We formally associate this to a Huber (M)- 
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LFSE (green) and several proper SE’s 



Figure 6: SE in the unbounded phase. Examples of proper state evolutions with /x = 2,5, 7.5,10, 
e = 0.05, m = 5. The LFSE dynamical system has no fixed point. The proper SE’s have fixed points, 
but the location of the fixed point is unbounded above. 


estimate in the linear model under asymptotic regime PL(m) with parameter A satisfying 

A = /t • V 1 + T^(m,K,F). 

We denote this correspondence by A = A 00 (m, k, F) and the inverse correspondence with k = 
«oo(m., X,F). 

Definition 5.2. The formal asymptotic variance of the Huber (M)-estimator under the PL(m) 
asymptotic framework is 

where t% 0 {m , k,F) denotes the fixed point of the floating threshold state evolution with parameter k 
and where A = Aoo(?n, k, F). 

Theorem [272] shows that this formula is rigorously correct - the Huber estimator with the specified 
parameter A indeed has almost surely an asymptotic variance and it is equal to the formal asymptotic 
variance. 

Lemma 5.3. Let Vm(n,£) = supfgj V^,(k, F) denote the worst case formal variance, across the 
full e-neighborhood, of the floating-threshold state evolution fixed point under capping parameter k. 
Set 

K + (m,£) = sup{K : V m (re,e) < oo}; 
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there is rao(e) G (l,oo) so that, for m > mo(e), we have V m («, e) < oo throughout (0, K + (m, e)). 
Define 

A (re;m, e) = sup Aoo (m,n,F). 

For m > mo(s), the mapping 

k t-) A(/«; m, e) 

is strictly increasing for 0 < k < K + (m,e). 

Figure [7] displays A(«) for a variety of choices of e, m; the monotonicity is evident. Numerics show 
that we may take mo = 7 t/ 2; however our proof only attempts to show that some mo sufficiently 
large will work. 




m = 5.00 



m = 10.00 



Figure 7: Monotonicity of k *->• A(«). Each subplot depicts A(«;m, e) as a function of k, for e G 
{0.01,0.02,0.05,0.10}, at one particular m. Evidently, asm-) 00 , A —) k. 


The monotonicity condition on A ensures that the least-favorable contamination for the Huber 
(M)-estimator is achieved by the improper distribution F e . 

Theorem 5.4. Evaluation of Minimax Asymptotic Variance of Huber (M)-estimator. If 

the mapping k e-) A (re; m, e) is strictly increasing for 0 < k < n + we have 

inf sup V^(A, F) = inf sup V^{k,F), 

A K FeF s 

where the minimax on the left concerns the formal variance of Huber (M)-estimates parametrized by 
X, and that on the right concerns the formal variance of floating-threshold state evolutions parametrized 
by k. The minimax tuning of the Huber (M)-estimator is achieved by the tuning parameter 

A*(e) = A (m,«*(e),e). 
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It follows of course that we have the formula 


inf sup V^(A ,F) 
A F£F S 


1 

I(F*) — 1/m' 


which agrees in the limit m —> oo with Huber’s classical formula for the scalar location problem: 

inf sup V(il> x ,F) = —. 

A FeT e 1 Fe ) 

Figure [2] shows contours of the minimax thresholding parameter AThe reader can see 
how this parameter shrinks to zero as 1/m approaches I(F*). While the story is much the same 
as for the k parameter in Figure [5j the A-parameter is the one relevant to practice, because the k 
parameter is a theoretical construct while the corresponding A parameter can actually be used to 
specify the desired Huber estimator in statistical software packages. 

Since the formal variance V^(A ,F) has the saddlepoint property, Theorem 2.2 shows that the 
rigorous asymptotic variance AVar = AVar (0 X ,F) (say) has it as well. 

Definition 5.5. Let F/ denote the subset of distributions in F e with finite variance: 112 (F) = 
f w 2 dF(w) < 00 . 

Corollary 5.6. Fix £ > 0, and m > mo(e). We are in the asymptotic regime PL(m). 

• Suppose that Vffie) is finite. Consider the formally minimax parameter A = A*(e); let 0* denote 
a corresponding solution of the Huber (M)-equation with that A. For every error distribution 
F E -Ff, we have 

AVar (§lF)<V* m (£). 

For every A A*(e) there is a proper e-contaminated normal error distribution F E F 2 so that 
the Huber estimator 6 X obeys 

AVar (6\F)>V* m (e). 

Consequently, 

inf sup AVar (6 X ,F) = AVar {(%,F e ) = V^(e). 

A FeJ* 


• Suppose that V^(e) is infinite. For every A > 0 and each V > 0, there is a proper e- 
contaminated normal error distribution F E F 2 with 

AVar (0 X ,F) > V. 


Consequently, 

inf sup AVar(0 A ,F) = +00 = Vffie). 

A Fe jrf 
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6 Discussion 


Under the high-dimensional PL{m ) asymptotic - as shown in }BBEKY13| - the maximum likeli¬ 
hood estimator is no longer an efficient estimator. It follows that the Huber estimator is no longer 
asymptotically minimax among all (M)-estimators. Hence the asymptotic minimax in ([6]) should 
better be called the asymptotic minimax among Huber estmates. The degree of sub optimality can 
be controlled explicitly. By |DM131 Corollary 3.7], the asymptotic variance under PL(m ) obeys the 
following inequality, which is strictly stronger than the Cramer-Rao bound when 1 < m < oo: 


Vm($\ ,F) > 


1 

1 — 1/m 


1 

wr 


and so the minimax asymptotic variance obeys: 


min max F) > 

A FeF £ 


1 

1 — 1 jm 


1 

m~r 


(28) 


It follows that provided mI(F*) > 1, 


minimax Huber asymptotic variance < K ■ minimax asymptotic variance, 


where 


K = K(m, e) 


1 — 1/m 
1 — I (F* )/m 


One sees directly that the sub-optimality of the Huber estimator is well controlled provided that 
I(F*) is close to one; i.e., in the regime where e is small enough (though this is m-dependent). Of 
course in the regime I(F*)m < 1, some other estimators could be dramatically more robust. 
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Appendix: Proofs 

Proof of Lemma 13.21 

The desired relations are true for iteration t = 0 by assumption (note that no assertion about the 
sequence (rt)t> l is made at stage t = 0, only about to). 

Suppose that we have proved the desired relations up to iteration t — 1 and we now must show 
that they hold for iteration t. 

We observe that F StCr is stochastically more spread than any proper distribution in F £ , a - that is, 
every distribution with all its mass on the reals M rather than the extended reals lU {±oo}. Hence 
for every function £(x) monotone increasing in |x|, 

sup{E£(X) JR CT } = E£(X e><T ). 
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In this sense X £](T is extremal among contaminated normals. Moreover, we note that for a > a, F e ^ 
is more spread than F ecr . Hence, again for £ monotone increasing in |x|, 


mXe,a) < mXe,*). 

Applying Lemma |3.3| , Claim 2, 

inf{E F ^' A (X, r):X~Fe F e , a ) = E*' x (X e>ff , r), V6, A > 0. (29) 

So in particular, for X~f£ J~ £ . (Tt _ l we have E (X £ ^ t _ 11 rt) < E ry). Hence we must have 

^ = EM (X,r t ) 


m 


— ®^A t (Xe,trt-i i r t) 

= E V'^Xe^rt). 


The first step is just the definition of n, the second step used (|29j), and the third step used (16). Now 
since r i —> j ^ is increasing in r, while r H > P{|X £j ct 4 _i| > k ■ 0t-i • (1 + r)} is monotone increasing in 
r. Hence the product E\H^ (X £ ^ t _ 1 , r) - is monotone increasing in r; so in order to satisfy the 

definition of ft - 

^ = E* 'j, ( X,*, ...f,) 

- we must have 

ft > r t . 

Now turn to the dominance relation concerning r 2 . 

sup{E^ t (A,r f ) :X~Fe = E* 2 ^,^, r t ) 

where in the first inequality we substituted © and in the second inequality we substituted rt e->- rt 
by Lemma 3.3 Claim 3. We conclude that 


T t < T t , 


which completes iteration t of the claimed result and sets up the assumptions for the next iteration. 

□ 


Proof of Lemma 13.31 

To prove the z-tli claim, for i = 1,...,4, combine formula \h(x;r) = rip\(x/(l + r)) with the 
corresponding numbered observation: 

1. From |^ A (x)| = min(|x|, A), |x| ^ |V’a(®)| is nondecreasing. 

2. From -i/^ (x) = 1{m<a} 5 \x\ >->■ is nonincreasing. 

3. r i —> r /(1 + r) is monotone increasing. 

4. A i—>- min(|a:|, A) is nondecreasing. 

□ 
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Proof of Lemma 13.41 

These are simple scaling invariances, combined with T(x;r) = 

The first, (16), says simply that for any 0 < a < a, 

= P{|A £ , a | <re-a} 

= <h (r ( — KCT, KCr) 

= <h(—re, re) 

= $>v(—Ka, na) 

= P{|X e ^| < K ■ a} 

= E<*(X e , ff ). 

The second, ( |17[ ), combines two invariances. If Z ~ N( 0,1), then 

E^>Z) = a 2 -E^(z); 

while, if U ~ is the degenerate improper random variable supported at oo, 

Hence 

^L(Xe,a) = <X 2 • ((1 - e)E^l(Z) + ere 2 ), 

is thus proportional to a 2 . Applying this both tocr = <7t,-\ and a = &t-i gives 


□ 


Proof of Lemma 13.91 

Note that r i —> and r e -y B{n- (1+r), F) are each strictly monotone increasing in r > 0. Moreover, 

inf B(k ■ (1 + r), F) = B(k ■ (1 + r), e). 

Set R(b ) = 1 /{mb— 1); this is monotone decreasing in b > 1/m. Then 77 = R(B) and this relationship 
is monotone decreasing in B > 1/m. Hence 

sup R(B(t 2 , r; me, re)) = R(B(k ■ (1 + r), e) 

F&F e 

Also note that r = R(B(k ■ (1 + f), e). It follows that for each 77 > 0, for some r > f — rj we can find 
F € F e satisfying 

r = R(B(t 2 , r; m, re, T 1 )). 

Hence, 

sup 77(r 2 ; m, re, F) > f. 

Since r i—?• H(re(l + r),e) is strictly monotone increasing, we conclude that for every r > f we must 
have 

r > r = R(B(k ■ (1 + r), e)) > R(B(k ■ (1 + r), e)) 
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and hence, for such r 


implying that 


and so also 


which proves (26). 


r> sup R{B{t 2 , r; m, k, F)), 


r> sup 77(r 2 ; m,/c, F), 
F£T e 


r> sup 77(r 2 ; m, k, F), 
FeJ E 


We turn to Eq. (27). Set a = Vl + t 2 /(1 + r). We have 


sup A(t 2 ,TZ(t] m, k, F); m, k, F ) 


sup A(t 2 ,r\m, n, F) 
F&F e 

(ar) 2 ■ A{k ■ (1 + f), e) 
2 

2 


1 + r 


2 A(/e,e) 


(1 + r 2 ) 

^ ^ m 2 B(R,e) 

(1 + t 2 ) • V(R,e)/m 2 

v4(r 2 ; m, e,«). 


j4(k, e) 


In the first step we used monotonicity of r ^ A(t 2 ,r;m, k, F), and in the second step, we used 
In each step inequality is clear, while equality is demonstrated by choosing a sequence 


Lemma 3.10 


of contamination cdfs H = 5, 






oo. 


□ 


Proof of Lemma 14.21 

For fixed (e,m), consider the relationship between (f, R) implied by 


r 

1 + f 


B(e, k) 


1 

m 


Note that B(e, R) = (1 — e)(2^(«) — 1) where 4>(x) is the standard normal CDF, which is a bijection 
between (—oo, oo) and (0,1). One can check that, for fixed (e, m), (r, R) are in one-one correspondence 
by the functions 

(1 - g )(2<f(B) - 1) 

' ' 1/m - (1 - e)(2®(S) - 1) 

and 

R(r) = 4*” 1 ((1+ lH ~ 1/r )/2V 
V m{ 1 - e) J 

acting as bijections f «->■ R between domains (0, oo) and (0, R*), where R*(e, m) = < L~ 1 ((1 + )/2). 

Defining 

k(k) = «/(! + r(R)), 
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the pair (k(k), r(/«)) will obey the relation 


-^-=B(e,/s(l + r)) = —. 
1 + r m 


We obtain the explicit expression 


k(k) = 


K 


1 + r(n )’ 

showing directly that k is uniquely defined in terms of R. for given (e, m). 


□ 


Proof of Lemma 14.31 


By Lemma 3.9 the variance map T is the pointwise supremum of all variance maps of proper 
floating-threshold state evolutions with F e T e . Hence, no proper FTSE can have a larger fixed 
point; i.e. 

T^(m,K,F) VF G F e . 

From V m (/c, F) = m ■ k, F), we have 

sup V m (/c,F) = m • k,e), 


and so, if r^m, k, e) < oo - implying V(k(k),£) < m 


,, , ^ V", er) 

sup V m (/e,F) = -- - . 

FeJ- £ 1 - V(K(K),e)/m 


Setting /Cq = {k : V(k(k),£) < m}, 


inf sup V m (/e,F) = inf -—• 
k ksJCo 1 — V{K{K),£)/m 


Now by construction, 

F(k(k*(£)),£) = V(K*(e),e) = u*(e); 

and moreover for « / /c*(e), k(k) / «*(e); so 

H(k(k),£) = V(R(n),e) > V(K*(e),s) = u*(£). 

Now v i—>• u/(l — v/rri) is monotone increasing on {v : v < m}. Consequently, if u*(e) < m 

v* (e) 

inf sup V m («,F) = -- w w . 

K FeJ- £ 1 — v*(e)/m 

By hypothesis v*(s)/m = 1 /(m/(F*)) < 1, and so this formula indeed holds. 

Now note that automatically 

inf sup V m (n,F) > sup inf V m (/v, F); 

K F£F £ F&Te K 


(30) 
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hence the argument will be completed by showing that 


v* (s) 

sup inf V m (K, F) = - - . 

F&Te K 1 — v*{£)/m 


But we have already shown by (30) that 


inf V m (n,F £ ) = 


v*{e) 


1 — v*(e)/m' 

For all but purists, this completes the proof of the saddlepoint relation 


v * (s) 

sup inf V m (K,F) = - —- - ^ 

f&; K 1 -v*(e)/m * 


= inf V m (K,F £ ). 


Purists who want everything stated using proper RV’s will want the following spelled out. Let 
G e> n = (1 — £)4> + eH For r/ > 0, there is n E R. with 

inf V m (K,G £ tl ) > inf V m (n,F e ) - 17 . 

K, K, 

Now note also that, for /j, > k, the Huber ifj K obeys 

V(K,G eili ) = V(K,G £tOQ ) = V(k,F £ ) = V (k, e) > 


with similar statements also being true for A and B. This observation can be elaborated into a full 
proof, exploiting 

V(k, (1 - e)d> + eN(ii, 7 )) -> V{k, F e ) 

as [i —> 00 . We omit the details. □ 


Proof of Lemma 14.41 

If m- I(F*) < 1, then V(k, e)/m > v*(e)/m = 1 /(mI(F*)) > 1 and so V(R(n),£)/m > 1 for each 
k > 0 . 

The variance map of the LFSE with parameters (m, k, e) is affine: 


T(r 2 ) 


V(k(k),e) h 2 \ 

-(I +T J 

m 


so both the slope and intercept equal V e) /m > 1. Hence there is no fixed point, and in fact there 
is a strict vertical gap between the identity line and the graph of T - a gap of size > 1 . 

Now T is the pointwise supremum of all the variance maps of proper state evolutions. Hence for 
any r we choose, there is a variance map T of some proper state evolution lying above the diagonal 
line at r 2 : 

T(T 2 ) > T 2 , 

which implies that the corresponding highest fixed point T(r^) = obeys > r 2 . For all but 
purists, this completes the proof. 

Purists will want to know that among the highest such fixed points are in fact unique fixed 
points, which then represent variances that are in fact achieved. We will show this for contaminated 
distributions of the form G £) ^, for large /i. 
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• For such G £)/J/ , [i sufficiently large, we will show that the variance map T is star shaped; namely, 
defining T = T(r 2 ; G e:fl ) by 

r(r 2 ) = (l + r 2 )-r(r 2 ), 

then we will show that for fi large, r 2 H > T(r 2 ) is a monotone nonincreasing function of r 2 . 

• Any such star-shaped map has a unique fixed point; if r 2 < r| are two distinct purported fixed 
points then because the line r 2 i-)- (1 + t 2 )T(t 2 ) has a unique fixed point at r 2 = r 2 , then 


> (1 + t 2 )T(t 2 ), t : 


>-i 2 - 


(31) 


Hence 


(1 + r 2 2 )T(r|) < (l + r 2 )T(r 2 ) 

= (1 + r 2 )T(r 1 2 )| , 2 


< r|. 


In the last step we use (31), evaluated at r 2 = r 2 . The last display contradicts the supposed 
fixed-point nature of r| and proves that the second fixed point r| cannot exist. 

To explain the star-shapedness, we need to develop some rescaling relationships. Let S a F denote 
the rescaling operator on CDF’s, producing ( S a F)(x ) = F(x/a). For a given F 6 F e and a given t 2 
and associated cr 2 = 1 + r 2 , let F a = S a (F * <h T ). We then have 

F a = (1 -e)$ + eH a t 

where the contamination CDF H a = S a (H * <h T ). Because of the scale invariance A = na, 

[ fpla(x)d(F-k^ r )(x) = a 2 [ i/> K (u)dF a (u). (32) 


Similarly, 


It follows that 


VU*)d (F*<L r )(*) = U' K (u)dF°(u). 


T(t 2 ) _ _ , r 2 ~ 


where k = «(1 + r) solves 


^—.B^- k ,F^) = ~. 
1 + r m 


The reader should check that the following claims, if established, would combine to prove the 
desired monotonicity of T. 

• A(ip KQ , F a ) is monotone decreasing in cr, for fixed kq. 

• F a ) is monotone decreasing in cr, for fixed ko- 

• k i —> A(i/i k , F) is increasing in k. 
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• k i —} -B(V’ki F) is increasing in k. 

• a H > r is monotone decreasing in a. 


• cr i— 't h is decreasing in a. 

Some of these are obvious - for example, monotonicity of n i— > A(ip K , F ) and k i— > B(ip K , F ). Others 
follow from earlier items - monotonicity of cr i —> k follows from that of cr i —> r, while monotonicity of 
cr e -> r follows from the two earlier claims about B. Finally, the first two claims will be shown for 
F = G £ .n for all sufficiently large ji. 

In the coming two paragraphs, let k be fixed independent of a. Now of course 


= I rt(u)dF°(u) (33) 

= ^W d$ + e J ^K u ) d (34) 

= I + IL 

the term I being independent of cr, we focus on the second one, IF Similarly, 

B(ip K ,F") = J^ K (u)dF^(u) (35) 

= (1 -e) J ^' K (u)d^ + £ J ip' K (u)dH a (u); (36) 

= III + IV. 


We again focus on the cr-varying term; this time IV. Letting H a = S a H we have H a = d> T / (J * H a . 
By associativity of convolution, 

J ifc(u)dH"(u) = J Wl*<I> T/a )(u)dH"(u). 


Similarly, 

J ' t P' K { x )dH (T (u) = J (ip' K *$ T/(T )(u)dH‘ 7 (u). 

Now note that, for all sufficiently large u, u i—>■ * & T / a )(u) is strictly monotone increasing. At 
the same time, again for all sufficiently large u, a e->■ (- ip^-k& T / a )(u ) is strictly monotone decreasing in 
cr. Also, let H^ denote the CDF of a point mass at /j, then H^/ a = S a H^. Consequently, cr h > S a H M 
is increasingly concentrated (rather than spread) as a increases. It follows that, for large enough 
M > 0, 

Jtyl* $T/a)(u)dH°(u) = (ipl * $ r/<T )(n/<r) 
is monotone decreasing in a. Similarly, for large enough n > 0, 

a ^ y « * $ t/ <t)(«) dfl£ (u) = « * ^ T / a )(n/<j) 

is monotone increasing in cr. 
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Because 

H a ) = ('ipl -k $ t/o )(h/<t) 

and 

H a ) = 

and the decompositions I + II and III + IV, our claims about the behavior of the RHS’s in these 
displays, for large //, imply the needed monotonicities of a H > A{ili Kl F a ) and a (->• B(ip K , F a ). □ 


Proof of Lemma 15.31 

Putting a(rn, k, e) = y^Wbi^mTeT^)) we note that 

A (m, k, e) = k ■ a(m, k, e). 


Now 


a{vn, k, e ) 2 = (1 + 


V(k, e)/m 
1 — V{%,£)/m 


1 

1 — V(k, e)/m 


By direct evaluation, the function k h > V(k,e) is at first strictly decreasing on (0,oo) to a minimum 
at the Huber minimax parameter k*(e), after which it is strictly increasing, tending to infinity as 
k —>• oo. 

Consequently, on the interval n E (K*(e),oo), the function k V(k,e) is strictly increasing. On 
the interval K, + = (k* (e), K + (e; m)) the function k*->- V(R(k),e) is likewise strictly increasing. Hence 
on /C+ k H > a is strictly increasing, and so also is k ■ a. 

Fix mo > H(0, e) = 2 (i-e) • For eac ^ m > m 0 i cr{m, 0, e) < oo, and this is the largest that 
a(m, k,e) ever gets on k E (0,K*(e)). On the interval /C_ = (0, K*(e)) the function k H > V(k(k),e) is 
bounded and has bounded derivative. It follows that, as m —> oo 


sup |cr(m, k, e) — 1 | — > 0 , m — > oo; 

kGIC- 

and also 

d 

SU P tv or (m, k, e) — 0 —> 0, m —> oo, 

kg k. 9k 

together implying 

d - 

sup — A(m., k,e) — 1| —>• 0, m —> oo, 

KS/C- ok 

yielding J^A(m, k, e) > 0 throughout /C_. 

We have shown that A is strictly increasing, as a function of k, throughout the whole domain 
(0, K + (m, e)) = /C_ U K.+ . □ 
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Proof of Theorem 15.41 


We first remark that for any specific A > 0, 


V° m (\,F £ ) = 


> 


V m (K{m,X,F E ),F e ), 
m • X,F e ),F e ), 

m ■ min T^(m, f, F e ), 

K, 

m • f^(m,/c*(e),e) 
V m {K\F £ ) = V* m {e). 


and so 


inf sup V° m (\F e ) > V^(e). 
A 


We complete the argument by showing that 


sup V°JA ,F £ )<V* m (e), 
F&F e 


or in other words: 

V° m (X*,F)<V* m (s), VFeF £ . 

To show this, we need merely to show that for each (f,F) yielding an instance where A (m,K,F) = 
X*(m, e, we have 

T^(m,K,F) < 7*(m,/e*(e),e), (37) 

since then 


V° m (\*,F) = Vm(K,F) 

= m ■ T^ 0 (m, k, F), 

< m-f^(m,K*(s),£) 

= V m (K*,F £ ) = V* m (e). 

Suppose that f > k(s), then from 

f ■ \J 1 T T^(m, k,F) = A (m,K,F) 

= \* 

= K*(s) ■ y/l + f^(m,K*(£),£) 

we conclude that 

V 1 + T%o(m,K,F) ■ = a / 1 + f%o(rn,K*(£),e) 

F (£) 


and since —^ > 1, we indeed obtain (37). 

To finish, we argue that f < k*(s) can never arise in a pair (k,F) obeying A (rn,n,F) = A*. By 
the monotonicity property of Lemma 5.3, if we have f < f*(e), 

sup A (m,F,F) = A (m,F,e) 

fgf £ 

< A (m,K*(e),e), 

proving that it can never happen that «(m, A *,F) < re*(e), for any F G F e . □ 
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